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Abstract 

This paper addresses the problem of inferring circulation of information between multiple stochastic processes. 
We discuss two possible frameworks in which the problem can be studied: directed information theory and Granger 
causality. The main goal of the paper is to study the connection between these two frameworks. In the case of directed 
information theory, we stress the importance of Kramer's causal conditioning. This type of conditioning is necessary 
not only in the definition of the directed information but also for handling causal side information. We also show how 
directed information decomposes into the sum of two measures, the first one related to Schreiber's transfer entropy 
quantifies the dynamical aspects of causality, whereas the second one, termed instantaneous information exchange, 
quantifies the instantaneous aspect of causality. After having recalled the definition of Granger causality, we establish 
its connection with directed information theory. The connection is particularly studied in the Gaussian case, showing 
that Geweke's measures of Granger causality correspond to the transfer entropy and the instantaneous information 
exchange. This allows to propose an information theoretic formulation of Granger causality. 

keywords directed information, transfer entropy, Granger causality, graphical models 

I. Introduction 

The importance of the network paradigm for the analysis of complex systems, in fields ranging from biology 
and sociology to communication theory or computer science, gave rise recently to the emergence of new research 
interests referred to as network science or complex network [9], [18], [50]. Characterizing the interactions between 
the nodes of such a network is a major issue for understanding its global behavior and identifying its topology. It 
is customary assumed that nodes may be observed via the recording of (possibly multivariate) time series at each 
of them, modeled as realizations of stochastic processes (see [60], [61] for examples in biology, or [43], [66], [12], 
[35] for applications in neurosciences). The assessment of an interaction between two nodes is then formulated as 
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a interaction detection/estimation problem between their associated time series. Determining the existence of edges 
between given nodes (or vertices) of a graph may be reformulated in a graphical modeling inference framework [74], 
[38], [15], [54], [39]. Describing connections in a graph requires to provide a definition for the interactions that will 
be carried by the edges connecting the nodes. Connectivity receives different interpretations in the neuroscience 
literature for instance, depending on whether it is 'functional', revealing some dependence, or 'effective' in the 
sense that it accounts for directivity [29], [66]. This differentiation in the terms describing connectivity raises the 
crucial issue of causality, that goes beyond the problem of simply detecting the existence or the strength of an edge 
linking two nodes. 

Detecting whether a connection between two nodes can be given a direction or two can be addressed by identifying 
possible 'master-slave' relationships between nodes. Based on the measurements of two signals x t and y t , the 
question is: 'Does x t influences y t more than y t influences x t T . Addressing this problem requires the introduction 
of tools that account for asymmetries in the signals information exchanges. 

Granger and others investigated this question using the concept of causality [24], [25], [20], [54] and emphasized 
that interaction between two processes is relative to the set of observed nodes. Actually, the possible interactions of 
the studied pair of nodes with other nodes from the network may profoundly alter the estimated type of connectivity. 
This leads to fundamental limitations of pairwise approaches for multiply connected network studies. Many authors 
addressed the topic of inferring causal relationship between interacting stochastic systems under the restriction 
of linear/Gaussian assumptions. In [20], [21] the author develops a general linear modeling approach in the time 
domain. A spectral domain definition of causal connectivity is proposed in [31], whose relationship with Granger 
causality is explored in [17]. However, all these techniques need to be extended or revisited to tackle nonlinearity 
and/or nonGaussianity. 

Information-theoretic tools provide a means to go beyond Gaussianity. Mutual information characterizes the 
information exchanged between stochastic processes [13], [56]. It is however a symmetric measure and does not 
provide any insight on possible directionality. Many authors have managed to modify mutual information in order 
to obtain asymmetrical measures. These are for example Saito and Harashima's transinformation [63], [32], [1], 
the coarse grained transinformation proposed by Palus et al. [51], [52], Schreiber's transfer entropy [64], [30]. All 
these measures share common roots which are revealed using directed information theory. 

A. Main contributions of the paper 

This paper is an attempt to make sense of and to systematize the various definitions and measures of causal 
dependence that have been proposed to date. Actually, we claim that these measures can be reduced to directed 
information, with or without additional causal conditioning. Directed information introduced by Massey in 1990 
[46] and based on the earlier results on Marko's bidirectional information theory [45], is shown to be an adequate 
quantity to address the topic of causal conditioning within an information theoretic framework. Kramer, Tatikonda 
and others have used directed information to study communication problems in systems with feedback [34], [67], 
[71], [68]. Although their work aimed at developing new bounds on the capacity of channels with feedback and 
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optimizing directed information, most of their results allow better insight in causal connectivity problems for systems 
that may exhibit feedback. 

Massey's directed information will be extensively used to quantifying directed information flow between stochastic 
processes. We show how directed information, which is intimately linked to feedback, provides a nice answer to the 
question of characterizing directional influence between processes, in a fully general framework. A contribution of 
this paper is to describe the link between Granger causality and directed information theory, both in the bivariate and 
multivariate cases. It is shown that causal conditioning plays a key role as its main measure, directed information, 
can be used to assess causality, instantaneous coupling and feedback in graphs of stochastic processes. A main 
contribution is then a reformulation of Granger causality in terms of directed information theoretic concepts. 

B. Organization of the paper 

As outlined in the preceding sections, directed information plays a key role in defining information flows in 
networks [45], [63], [32], [46], [34], [67], [68], [60], [61], [65], [3], [5]. Section M gives a formal development of 
directed information following earlier works of Massey, Kramer and Tatikonda [46], [34], [67]. Feedback in the 
definition of directed information is revisited, together with its relation to Kramer's causal conditioning [34]. This 
paper extends these latter ideas and shows that causally conditioned directed information is a means of measuring 
directed information in networks: it actually accounts for the existence of other nodes interacting with those studied. 
The link between directed information and transfer entropy [64] established in this section is a contribution of the 
paper. In section [ill] we present Granger causality which relies on forward prediction. We particularly insist on the 
case of multivariate time series. Section[IV]is devoted to developing the connection between the present information 
theoretic framework and Granger causality. Although all results hold in a general framework explained in section 
IIV-CI a particular attention is given to the Gaussian case. In this case directed information theory and Granger 
causality are shown to lead to equivalent tools to assess directional dependencies (see also [5]). This extends 
similar recent results independently obtained by Barnett et. al. [8] in the case of two interacting signals without 
instantaneous interaction. An enlightening illustration of the interactions between three time series is presented in 
section M for a particular Gaussian model. 

II. Measuring directional dependence 

A. Notations and basics 

Throughout the paper we consider discrete time, finite variance ^[ja;) 2 ] < +oo stochastic processes. Time samples 
are indexed by Z; x2 stands for the vector (xk, Xk+i, ■ ■ ■ , x n ), whereas for k = 1, the index will be omitted for the 
sake of readability. Thus we identify the time series {x(k), k = 1, . . . ,n} with the vector x n . E x [.] will denote the 
expectation with respect to the probability measure describing x, whereas E p [] will indicate that the expectation 
is taken with respect to the probability distribution p. 

In all the paper, the random variables (vectors) considered are either purely discrete, or continuous with the added 
assumption that the probability measure is absolutely continuous with respect to the Lebesgue measure. Therefore, 
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all derivations hereafter are valid for either cases. Note however that existence of limits will be in general assumed 
when necessary and not proved. 

Let H{x n ) = — E x [\ogp(x n )] be the entropy of a random vector x n whose density is p. Let the conditional 
entropy be defined as H(x n \y n ) — — E[logp(x n \y n )]. The mutual information I(x n ;y n ) between vectors x n and 
y n is defined as [13]: 

I(x n ;y n ) = H(y n )-H(y n \x n ) 

= D KL (p(x n ,y n )\\p(x n )p(y n )) (D 

where DxLipWq) = E p [logp(x)/q(x)] is the Kulback-Leibler divergence. It is if and only if p = q almost 
everywhere and is positive otherwise. The mutual information effectively measures independence since it is if and 
only if x n and y n are independent random vectors. As I(x n - 1 y n ) = I(y n -x n ), mutual information cannot handle 
directional dependence. 

Let z n be a third time series. It may be a multivariate process accounting for side information (all available 
observation but x" and y n ). To account for z n , the conditional mutual information is introduced : 

I(x n ;y n \z n ) = E z [D KL (p(x n ,y n \z n )\\p(x n \z n )p(y m \z n ))] (2) 
= D KL (p(x n ,y n ,z n )\\p(x n \z n )p(y n \z n )p(z n )) (3) 

I(x n ;y n \z n ) is zero if and only if x n and y n are independent conditionally to z n . Stated differently, conditional 
mutual information measures the divergence between the actual observations and those which would be observed 
under Markov assumption (x — > z — > y). Arrows may be misleading here, as by reversibility of Markov chains, the 
equality above holds also for (y — > z — > x). This again emphasizes the inability of mutual information to provide 
answers to the information flow directivity problem. 

B. Directed information 

Directed information was introduced by Massey [46], based on the previous concept of "bidirectional information" 
of Marko [45]. Bidirectional information focuses on the two nodes problem, but accounts for the respective roles 
of feedback and memory in the information flow. 

1) Feedback and memory: Massey [46] noted that the joint probability distribution p(x n ,y n ) can be written as 
a product of two terms : 

n 

^(x^- 1 ) = n^i^" 1 ,^- 1 ) 

i=i 

n 

f(y n \x n ) = \{p(y l W,y 1 - 1 ) 

i=l 

p(x n ,y n ) = <p{x n \ y n - l )^(y n \x n ) (4) 

where for i = 1 the first terms are respectively p(xi) and p(yi\xi). Assuming that x is the input of a system that 
creates y, V(a ; ™|y™~ 1 ) can be viewed as a characterization of feedback in the system. Therefore the name feedback 
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factor: each of the factors controls the probability of the input x at time i conditionally to its past and to the past 
values of the output y. Likewise, the term p (y n \x n ) will be referred to as the feedforward factor. The factorization 
leads to some remarks: 

• In the absence of feedback in the link from x to y, one has 

p(x i \x i - 1 ,y i - 1 )=p(x i \x i - 1 ) Vi > 2 (5) 

or equivalently 

JTfcila? 4 - 1 ,?* -1 ) = H{x l \x i - 1 ) Vi > 2 (6) 

As a consequence : 

<p(x n \y n - 1 )= P (x n ) (7) 

• If the feedforward factor does not depend on the past, the link is memoryless : 

p(y i \x i )=p(y i \x i ,y i - 1 ) V* > 1 (8) 

• Let D be the unit delay operator, such that Dy n — y n -i- We define Dy n = (0, y\, y%, . . . , y n -i) for finite 
length sequences, in order to deal with edge effects while maintaining constant dimension for the studied time 
seriejjj Then we have 

■f{x n \Dy n ) = <p(x n \y n - 1 ) (9) 

The feedback term in the link x — > y is the feedforward term of the delayed sequence in the link y — > x. 
2) Causal conditioning and directed information: In [34], Kramer introduced an original point of view, based 
upon the following remark. The conditional entropy is easily expanded (using Bayes rules) according to 

n 

H(y n \x n )=J2H(y i \y i -\x n ) (10) 

1=1 

where each term in the sum is the conditional entropy of y at time i given its past and the whole observation of 
x : Causality (if any) in the dynamics x — > y is thus not taken into account. Assuming that x influences y through 
some unknown process, Kramer proposed that the conditioning of y at time i should include x from initial time 
up to time i only. He named this causal conditioning, and defined causal conditional entropy as 

n 

H{y n \\x n ) = j2 H (y*\y i ~ 1 ' xi ) 

i=l 

By plugging causal conditional entropy in the expression of mutual information in place of the conditional entropy, 
we obtain a definition of directed information : 

n 

I(x n ->y n ) = H(y n )-H(y n \\x n )=Y J l(x t iy*\y l ~ 1 ) (12) 

i=l 

'The term in Dy n = (0, y\, 3/2 > • • • > Vn— l) indicates a wild card which plays no influence on conditioning, and makes sense as yo is not 
assumed observed. 
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Alternately, Tatikonda's work [67] leads to express directed information as a Kullback-Leibler divergent : 

I(x n ^y n ) = D K l (p(x n ,y n )\\ < p(x n \y n ~ 1 )p(y n )) (13) 
"log. 

loe 



= E 



= E 



~f{y n \x n 

p(y n ) 



(14) 



The expression (fT~4-b highlights the importance of the feedback term when comparing mutual information with 
directed information: p(x n ) in the expression of the mutual information is replaced by the feedback factor 
4 p(x n \y n ~ 1 ) in the definition directed information. 

This result allows the derivation of many (in)equalities rapidly. First, as a divergence, the directed information 
is always positive. Then, since 



I(x n ->y n )=E 
Using equations (0 and ( TPfl ) we get 



lo 



p(x n ,y n ) p(x n ) 

X 



- E 



log- 



i p(x n \y n ~ 1 )p(y n ) p(x n ) 
p{x n ) 



^-(a;n|yn-l) 

Substituting this result into eq. dT2b we obtain 



I{x n -> y n ) 



I(x n ;y n )+E 



I(Dy n x n ) 

p(x n ) 



log 



= I{x n -y n )-Y,I{^V t - 1 \x t ' 1 ) 

i 

= I(x n ;y n )-I(Dy n ^ x n ) 



(15) 



(16) 



(17) 



Equation ( fTTI i is fundamental as it shows how mutual information splits into the sum of a feedforward information 
flow I(x n — > y n ) and a feedback information flow I(Dy n — > x n ). In this absence of feedback, ' e p(x n \y n ) = p(x n ) 
and I(x n ; y n ) = I(x n — > y"). Equation ([Tol l shows that the mutual information is always greater than the directed 
information, since I(Dy n — > x") = J2i I( x i'i y t ^ 1 \x 1 ^ 1 ) > 0. As a sum of positive terms, it is zero if and only if 
all the terms are zero : 



or equivalently 



I(x i ;y l - 1 \x t ~ 1 )=0Vi = 2, 



H{xi \x i - l ,y % - 1 ) =H( Xi a;* -1 ) Vt = 2,...,n 



(18) 



This last equation states that without feedback, the past of y does not influence the present of x when conditioned 



on its own past. Alternately, one sees that if eq. (fT8l holds, then the sequence y l 1 



— > Xi forms a Markov 



chain, for all i: again, the conditional probability of x given its past does not depends on the past of y. Equalities 



2 The proofs rely on the use of the chain rale I(X, Y; Z) = I(Y; Z\X) + I(X; Z) in the definition of the directed information. 
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(\M can be considered as a definition of the absence of feedback from y to x. All this findings are summarized in 
the following theorem: 

Theorem: ([46] and [47]) The directed information is less than or equal to the mutual information, with equality 
if and only if there is no feedback. 

This theorem implies that mutual information over-estimates the directed information between two processes 
in the presence of feedback. This was thoroughly studied in [34], [67], [71], [68], in a communication theoretic 
framework. 

Summing the information flows in opposite directions gives: 

I{x n y n ) + I{y n ^x n ) = E 



p(x n ,y n ) p(x n ,y 



I(x n ;y n )+E 



i p(x n \y n - 1 )p(y n ) i p{y n \x n )p(x n ) 

f(y n \x n ) 



log- 



4 p{y n \x n )_ 

I(x n ;y n )+I(x n -^y n \\Dx n ) (19) 



where 



I(x n ^ y n \\Dx n ) = Y^I&yily*- 1 ^- 1 ) 

■i 

= ^/(a^y- 1 ,^" 1 ) (20) 

i 

This proves I(x n — > y n ) + I(y n — > x n ) is symmetrical but is in general not equal to the mutual information, 
except if and only if I(xi\ yi\y l_1 , x 1 ^ 1 ) = 0, Vi = 1, . . . , n. Since the term in the sum is the mutual information 
between the present samples of the two processes conditioned on their joint past values, this measure is a measure of 
instantaneous dependence. The term I(x n —> y n \\Dx n ) = I(y n — > x n \\Dy n ) will thus be named the instantaneous 
information exchange between x and y. 

C. Directed information rates 

Entropy as well as mutual information are extensive quantities, increasing (in general) linearly with the length n 
of the recorded time series. Shannon's information rate for stochastic processes compensates the linear growth by 
considering A oc (x) — lim„^ +00 A n (x)/n ( if the limit exists), where A n (x) denotes any information measure on 
the sample x of length n. 

For the important class of stationary processes (see e.g. [13]) the entropy rate turns out to be the limit of the 
conditional entropy : 

lim -H(x n ) = lim H^x 11 - 1 ) (21) 

Kramer generalized this result for causal conditional entropies, thus defining the directed information rate for 
stationary processes as 

1 " 

Ioo(x^y) = lim - V^O^yil?/ -1 ) 

n— >-\-oo n — ^ 
i=l 

= lim I(x n ;y n \y n - 1 ) (22) 

n— >+oo 
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This result holds also for the instantaneous information exchange rate. Note that the proof of the result relies on 
the positivity of the entropy for discrete valued stochastic processes. For continously valued processes, for which 
entropy can be negative, the proof is more involved and requires the methods developed in [56], [26], [27], see 
also [68]. 

D. Transfer entropy and instantaneous information exchange 

Introduced by Schreiber in [64], [30], transfer entropy evaluates the deviation of the observed data from a model 
assuming the following joint Markov property 

P{yn\ylzl +1 ,x n n z\ +1 )=p{yn\ylzl +1 ) (23) 

This leads to the following definition 

. p{yn\y n _ k+ n x n _ l+l ) 



T{x n n -_] +1 ->y: i - k+1 ) = E 



log ; 



(24) 



PiVnlVn-k+l) 

Then T(a;™lL_ 1 — >■ y^-k+i) = ^ ec l- <G3 is satisfied. Although in the original definition the past of x in the 
conditioning may begin at a different time m ^ n, for practical reasons m = n is considered. Actually, no a priori 
is available about possible delays, and setting m = n allows to compare the transfer entropy with the directed 
information. 

By expressing the transfer entropy as a difference of conditional entropies, we get 

T{xlz} +1 ^yl- k+1 ) = H(y n \y^l +1 )-H(y n \yZzl +1 ,x^_l +1 ) 

= i{xiz] + ^yn\y n n zl +l ) (25) 

For I = n = k, the identity 7(ar, y; z\w) = I(x; z\w) + I(y; z\x, w) leads to 

I(x n ; ynly"- 1 ) - JOr"- 1 ;^"" 1 ) + I{x n ;y n \x n -\ y™" 1 ) 

= T{x n - 1 ^ y n ) + I^-y^ 1 ^™- 1 ) (26) 
For stationary processes, letting n — > oo and provided the limits exist, we obtain for the rates 

Ioo{x -> y) = Too^x -)• y) + loo {x -> y\\Dx) (27) 

Transfer entropy is the part of the directed information that measures the causal influence of the past of x onto the 
present of y. However it does not take into account the possible instantaneous dependence of one time series on 
another, which is handled by directed information. 

Moreover, only I(x l ~ 1 ;yi\y l ~ 1 ) is considered in T, instead of its sum over i in the directed information. Thus 
stationarity is implicitly assumed and the transfer entropy has the same meaning as a rate. Summing over n in eq. 
(1261 . the following decomposition of the directed information is obtained 

I(x n -> y n ) = I{Dx n -> y n ) + I(x n -> y n \\Dx n ) (28) 

Eq. d28l establishes that the influence of one process on another may be decomposed into two terms accounting 
for the past and for instantaneous contributions respectively. 
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E. Accounting for side information 

The preceding definitions all aim at proposing definitions of information exchange between x and y; the possible 
information gained from possible connections with the rest of the network is not taken into account. The other 
possibly observed time series are hereafter referred to as side information. The available side information at time 
n is noted z n . Then, two conditional quantities are introduced : conditional directed information and causally 
conditioned directed information. 

I{x n ^y n \z n ) = H(y n \z n ) - H(y n \\x n \z n ) (29) 

I(x n -> y n \\z n ) = H(y n \\z n )-H(y n \\x n ,z n ) (30) 

where 

H(y n \x n \\z n ) = H(y n ,x n \\z n ) - H(x n \\z n ) (31) 

n 

H(y n \\x n \z n ) = Y. H W'^ x ^ zn ) ( 32 ) 

i=l 

In these equations, following [34], conditioning goes from left to right : the first conditioning type met is the one 
applied. 

Note that for usual conditioning, variables do not need to be synchronous with the others and can have any 
dimension. The synchronicity constraint appear in the new definitions above. 

For conditional directed information, a conservation law similar to eq. ( fT9l ) holds: 

I(x n ^y n \z n ) + I(Dy n -> x n \z n ) = I(x n ;y n \z n ) (33) 

Furthermore, conditional mutual and directed information are equal if and only if 

H(xi\x l -\y l -\z n ) = H(xi\x l -\z n ) Vi = l,...,n (34) 

which means that given the whole observation of the side information, there is no feedback from y to x. Otherwise 
stated, if there is feedback from y to x and if I(Dy n — > x n \z n ) = 0, the feedback from y to x goes through z. 

Finally, let us mention that conditioning with respect to some stationary time series z similarly leads to define 
the causal directed information rate as 

1 " 

Ioo{x^y\\z) = lim -Y I{x l - yi \y l ~\z*) (35) 

i=l 

= lim I{x n ;y n \y n -\z n ) (36) 

71— 7- + OO 

This concludes the presentation of directed information. We have put emphasis on the importance of Kramer's 
causal conditioning, both for the definition of directed information and for taking into account side information. 
We have also proven that Schreiber's transfer entropy is that part of directed information dedicated to the strict 
sense causal information flow (not accounting for simultaneous coupling). Next section revisits Granger causality 
as another means for assessing influences between time series. 
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III. Granger causality between multiple time series 

A. Granger's definition of causality 

Although no universally well accepted definition of causality exists, Granger approach of causality is often 
preferred for two major reasons. The first reason is to be found in the apparent simplicity of the definitions and 
axioms proposed in the early papers, that suggests the following probabilistic approach for causality : y n is said to 
cause x n if 

Prob (x n e A|fi„_i) ^ Prob(a;„|a i _i\i/ n - 1 ) (37) 

for any subset A. 0„ was called by Granger "all the information available in the universe" at time n, whereas 
£l n \y n stands for all information except y n . In practice, Cl n \(x n , y n ) is the side information z n . 

The second reason is that, in his 1980 paper [24], Granger introduced a set of operational definitions, thus 
allowing to derive practical testing procedures. These procedures require the introduction of models for testing 
causality, although Granger's approach and definitions are fully general; furthermore, Granger's approach raises the 
important issues below: 

1) Full causality is expressed in terms of probability and leads to relationships between probability density 
functions. Restricting causality to relations defined on mean quantities is less stringent and allows more 
practical approaches. 

2) Instantaneous dependence may be added to the causal relationships, e.g. by adding y n to the set of observations 
available at time n — 1. This leads to a weak concept as it is no longer possible to discriminate between 
instantaneous causation of x by y, of y by x or of feedback, at least without imposing extra structures to the 
data models. 

3) It assumed that fi„ is separable : fl n \y n must be defined. This point is crucial for practical issues: the causal 
relationship between x n and y n (if any) is intrinsically related to the set of available knowledge at time n. 
Adding a new subset of observations, e.g. a new time series, may lead to different conclusions when testing 
causal dependencies between x and y. 

4) If y n is found to cause x n with respect to some observation set, this does not preclude the possibility that 
x n causes y n if there exists some feedback between the two series. 

Item 2 above motivated Geweke's approaches [20], [21], discussed below. Item 3 and 4 highlights the importance of 
conditioning the information measures to the set of available observations (related to nodes that may be connected 
to either x or y), in order to identify causal information flows between any pair of nodes in a multi-connected 
network. As a central purpose of this paper is to relate Granger causality and directed information in presence of 
side information, the practical point of view suggested by Geweke is adopted. It consists in introducing a linear 
model for the observations. 
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B. Geweke's approach 

Geweke proposed measures of (causal) linear dependencies and feedback between two multivariate Gaussian 
time series x and y. A third time series z is introduced as side information. This series allows to account for the 
influence of other nodes interacting with either x or y, as this may happen in networks where many different time 
series or multivariate random processes are recorded. The following parametric model is assumed, 

oo oo b 

%n — ^ ^ Ai jS X n — s -\- ^ ^ Bi s y n — S -\- ^ ^ Cti^ s Z s -\- Ui^t 

oo oo b 

yn ^ ^ Ci.s-£"n — s ^ ^ ^ i,sVn — s ~t~ ^ ^ /^z,s^s ~l~ ^i,t 

8=0 s=l s=0 

Accounting for all z corresponds to b = oo in eq. ( l38l l, whereas causally conditioning on z is obtained by setting 
b = n — 1. Furthermore, it is assumed that all the processes studied are jointly Gaussian. Thus the analysis can be 
restricted to second order statistics only. 

Under the assumption that the coefficients cti tS and /3i tS are set to zero (leading back to original Geweke's model), 
we easily see that eq. d38l l can actually handle three different dependence models indexed by i — {1, 2, 3}, as defined 
below : 

• i = 1 : no coupling exists, B\ s — 0, C\ s = 0,Vs and the prediction residuals u\ t and V\ t are white and 
independent random processes. 

> i = 2, both series are only dynamically coupled : B\q = 0, Ci t o — and the prediction residues are white 
random processes. Linear prediction properties allow to show that the cross correlation function of U2,t and 
V2.t is different from zero for the null delay only : I^m, (t) — a 2 8(t). 

• t = 3; the time series are coupled : B% iS ^ 0, C^ :S ^ 0, Vs and the residues u^, t and v^^ are white, but are 
no longer independent. 

Note that models 2 and 3 differ only by the presence (model 3) or absence (model 2) of instantaneous coupling. It 
can be shown that these models are 'equivalent' if a 2 ^ 0, thus allowing to compute an invertible linear mapping 
that transforms model 2 into a model of type 3. This confirms that model 3 leads to some weak concept, as already 
quoted previously. The same analysis and conclusions hold when the coefficients a^ s and /3i, s are restored. 

C. Measures of dependence and feedback. 

Geweke [20], [21] introduced dependence measures constructed on the covariances of the residues u i t and v i t 
in ( 1381 ). We briefly recall these measures. Let 

el a {x n \x n -\y\z b )= lim e 2 (x n \x n -\ y l , z b ) (39) 

n— >+oo 



for I — n or n — 1 according to the considered model 



~ 2 (x n \x n 1 , y l , z b ) is the asymptotic varianc 



of the prediction residue when predicting x n from the observation 



3 The presence of n in the notation £^0 is an abuse of notation, but is adopted to keep track of the variables involved in this one-step 
forward prediction. 
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of x n_1 , y l and z b . For multivariate processes, e 2 () is given by the determinant detTj q of the covariance matrix 
of the residues. 

Depending on the value of b in (l38~l l. and following Geweke, the following measures are proposed for b = oo : 



-■ocx 



F y^x\z = log -, — - — ; 



F*-*„\z = log 



I ™n — 1 

eo D (y n \y n -\z 00 ) 



F xv \ z = bg £ °°yy x lE l±) (40) 



and for causal conditioning, b = n— 1: 



F lno- £ oo( x "l :E?1 ■ - 

^»->a:||« lu g , „n-l 7 n-n 



^X^y||2 = lOg 



jdfeli/"- 1 ^"- 1 ) 



Eoo(yn|!B n - 1 ,y n - 1 ,« n - 1 ) 



g eooC^i^-Sir,^- 1 ) ( ' 

Note that these measures are greater or equal to zero. 
Remarks : 

• F x y \ z and F x y n z can be shown to symmetric with respect to x and y [20], [21]. This is not the case for the 
other measures: if strictly positive, they indicate a direction in the coupling relation. 

• Causally conditional on z, F x ^ y \\ z measures the linear feedback from x to y and F x _ y u z measures the 
instantaneous linear feedback, as introduced by Geweke. 

• In [62], Rissanen and Wax introduce measures which are no longer constructed from the variances of the 
prediction residues but rather from a quantity of information (measured in bits) that is required for performing 
linear prediction. One cannot afford to deal with infinite order in the regression models, and these approaches 
are equivalent to Geweke's. In [62], the information contained in the model order selection is taken into 
account. We will not develop this aspect in this paper. 

IV. Directed information and Granger Causality 
We begin by studying the linear Gaussian case, and close the section by a more general discussion. 

A. Gaussian linear models 

Although it is not fully general, the Gaussian case allows to develop interesting insights into directed information. 
Furthermore, it provides a bridge between directed information theory and causal inference in networks, as partly 
described in an earlier work [5], [8]. The calculations below are conducted without taking observations others than 
x and y, as it is straightforward to generalize in the presence of side information. 
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Let H(y k ) — 1/2 \og(2ne) k \ det r^n,) be the entropy of the k dimensional Gaussian random vector y k of 
covariance matrix T y k . Using block matrices properties, we have 

det T y k = e 2 {y k \y k ~ 1 )det T yk -i (42) 

where £ 2 (y k \y k ~ 1 ) is the linear prediction error of y at time k given its past [11] . Then, the entropy increase ij^| 



H{y k ) - H{y k - 1 ) = ~ log 



det T y k 



det T y k-i 



(43) 



= iloge 2 ^.^- 1 ) (44) 

Let e 2 (y k \y k ~ 1 ,x k ) be the power of the linear estimation error of y k given its past and the observation of x up to 
time k. Since I(x k ; y k \y k ~ 1 ) — H(y k ) — H(y k ^ 1 ) — H(x k ; y k ) + H(x k ; y^ 1 ), the conditional mutual information 
and the directed mutual information respectively writes 

l lQ£ e\y k \y" -^ 
2 e 2 (y k \y k ~ 1 ,x k ) 



i(* k ;y k \v k -') - > (45) 



2 z_, e 2 (yi\y l L ,x l ) 

If furthermore the vectors considered above are built from jointly stationary Gaussian processes, letting n — > oo 
in eq. ( |43T > gives the directed information rates: 

l lQg eUv k \v k - r ) 
2 ela(Vk\y k ~ 1 ,x k ) 

where s 2 00 (y k \y k ~ 1 ) is the asymptotic power of the one step linear prediction error. By reformulating eq. ( f4Tb as 



el(y k \y k -\x k ) = e~ 21 ™^ e^y^- 1 ) (48) 

shows that the directed information rate measures the advantage of including the process x into the prediction of 
process y. 

If side information is available as a time series z, and if x, y and z are jointly stationary, the same arguments as 
above lead to 

el(y k \y k -\x k ) z k - 1 ) = e- 2I °°^" D *h 2 00 (y k \y k -\z k - 1 ) (49) 

where recall that Dz stands for the delayed time series. This equation highlights that causal conditional directed 
information has the same meaning as directed information, provided that we are measuring the information gained 
by considering x in the prediction of y given its past and the past of z. 

4 Note that if y is a stationary stochastic process, the limit of the entropy difference in eq. )44t is nothing but the entropy rate. Thus, taking 
the limit of eq. j44t exhibits the well known relation between entropy rate and asymptotic one step linear prediction [13]. 
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B. Relations between Granger and Massey 's approaches 

To relate previous results to Granger causality, the contribution of the past values must be separated from those 
related to instantaneous coupling in the directed information expressions. A natural framework is provided by 
transfer entropy. 

From eq. d28l l. under the assumption that the studied processes are jointly Gaussian, arguments similar to those 
used in the previous paragraph lead to 

n*^y"\\n*") = l± yy^ y ^ ) (5i) 

Likewise, causally conditioned directed information decomposes as 

I(x n ^ y n \\Dz n ) = I(Dx n ^ y n \\Dz n ) + I(x n ^ y n \\Dx n ,Dz n ) (52) 
Expressing conditional information as a function of prediction error variance we get 



I(x ^y \\Dz ) = -^bg-^-— — — — _— + -^log 



2 Z-. * e 2( y .\ y i-i iX i-i jZ i-i} 2 f^ e 2 {y l \y^ 1 ,x\z^ 1 ) 
The first term accounts for the influence of the past of x onto y, whereas the second term evaluates the instantaneous 
influence of x on y, provided z is (causally) observed. 

Finally, letting n oo, the following relations between directed information measures and generalized Geweke's 
indices are obtained : 



2 g e 2 { Vi \yi-\xi-i) 
2 ° g ^(tfiltf*- 1 ,!*) 
2 S e 2 (yi|V , '- 1 >^- 1 .« i - 1 ) 



Ioo{X^y \\DX,Z) = -log =J? x.y||z 

This proves that for Gaussian processes, directed information rates (causal conditional or not) and Geweke's indices 
are in perfect match. 

C. Directed Information as a generalized Granger's approach 

These results are obtained under Gaussian assumptions and are closely related to linear prediction theory. However, 
the equivalence between Granger's approach and directed information can hold in a more general framework by 
proposing the following information theoretic based definitions of causal dependence: 

1) Xt is not a cause of yt with respect to z t if and only if I OQ (Dx — > y\\Dz) = 

2) Xt is not instantaneously causal to y t with respect to z t if and only if -Too (a; — > y\\Dx, Dz) = 
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These directed information based definitions generalize Granger's approach. Furthermore, these new definitions 
allow to infer graphical models for multivariate time series : This builds a strong connection between the present 
framework and Granger causality graphs developed by Eichler and Dalhaus [14]. This connection is further explored 
in [7] and in the recent work [58]. 



To illustrate the preceding results, we study the information flow between components of a multivariate Gaussian 
process. To stress the importance of causal conditioning and of availability of side information, we separate 
the bivariate analysis from the multivariate analysis. Furthermore, we particularly concentrate on a first order 
autoregressive model. 

Let X n = CX n _i + W n be a multidimensional stationary, zero-mean, Gaussian process. W n is a Gaussian white 
multidimensional noise with correlation matrix T w (not necessarily diagonal). The off-diagonal terms in matrix C 
describe the interactions between the components of X. denotes the coupling coefficient from component i to 
component j. The correlation matrix of X is a solution of the equation 



Main directed information measures are firstly evaluated on a bivariate process. Then side information is assumed 
to be observed, and the same information measures are reconsidered for different coupling models. 

A. Bivariate AR( 1 ) model 

Let [v n , Wn}* = W n and a v , a w be their standard deviations and j vw their correlation coefficient. Let X n = 
[xnyUnf- Fx is computed by solving eq. d53l as a function of the coupling coefficients between x n and y n . The 
initial condition (sci, yi) is assumed to follow the same distribution as (x n , y n ) to ensure the absence of transients. 

Under these assumptions, some computations lead to express the mutual and directed information as 



V. Application to multivariate Gaussian processes 



T x = CTxC + T. 



w 



(53) 




(54) 



(55) 



(56) 



where I{x\; j/i) = —1/2 log (l — j^ y / {(J^cry)) and j xy stands for the correlation between x and y. 



Equations (154155156b raise some comments: 



1) The directed information is clearly asymmetric. 
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2) On one hand, the left hand side of the conservation equation (fT9b is given by summing equations ( f55l l and 
On the other hand summing the mutual information ( 154-b and 

I(x n ^ y n \\Dx n ) = I(x 1 ;y 1 )+J2l(xhVi\y i ~\x i ~ 1 ) (57) 



i>2 



, 2 (58) 

V u -u / v 



gives as expected the right hand side of the conservation equation (T% . We recover the fact that for independent 
noise components (p/ vw = 0) the sum of the directed information flowing in opposite directions is equal to 
the mutual information. This is however not the case in general. 
3) The information rates are obtained by letting n — » oo in eq. i55[ , ( T56l i: 

Iooix^y) = 1 log ( ° x f* + a 2 w xal) (59) 



2 "~° \ a 2 a 2 - <7 2 

if \ ^ V w 'l/J / '. 



/«(v->*) = \\og[ + xoil (60) 



This shows that if e.g. c y;r = 0, we observe that a coupling is equal to zero in one direction, the directed 
information rate from y to x satisfies 

\\og( 2 a } ul 2 ) = lim -I(x n ^y n \\Dx n ) (61) 

The right hand side of the equality may be interpreted as a lower bound for the directed informations rates. 
In particular, when T w is diagonal, this bound is zero. 

This corresponds to the decomposition ( 1281 for the rates, Iooip y) = Ioo(Dx — > y) + Ioo{x — > y\\Dx). 

( 2 a 1 \ 
1 + X ^ 2 x J is Schreiber's transfer entropy, or the directed information 

from the past of x to y (this term is equal to the first index of Geweke in this case). The second term 

Ioo(x — > y\\Dx) corresponds to the second of Geweke's indices and measures the instantaneous coupling 

between the time series. 

4) Directed information increases with the coupling strength, as expected for a measure of information flow. 

B. Multivariate AR( 1 ) model 

Let X n = [z n ,x n , W n — [ii n ,« n ,w n f be a three dimensional Gaussian stationary zero mean process 
satisfying the AR(1) equation, satisfying the same set of hypothesis and notations as above. We study two cases 
described in figure (HJ, where the arrows indicate the coupling direction. 

The distributions of the variables y n \y n ^ 1 and y n \y n ~ 1 ,x n ^ 1 required in the calculation of e.g. I^x — > y) are 
difficult to obtain explicitly. Actually, even if X is a Markov process, the components are not. However since we 
deal with and AR(1) process, p(y n \X n ^ 1 ) = p(y n \X n -i) and p(x n , y^X 11 ^ 1 ) = p(x n) y n \X n -\). The goal is to 
evaluate I^y -> x\\Dz). As I{y n ;x n \x n ' 1 , z 71 - 1 ) = I{y n ~ 1 \ x n \x n - 1 , z"" 1 ) + I(y n ; x n |X n_1 ), one has 

I^y -> x\\Dz) = lim I(y n ;x n \x n - 1 ,z n - 1 ) 

n—too 

= lim /(y"- 1 ;^ n | a; «- 1 ,^- 1 )-l/21og(l-7^/(aX)) (62) 
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where j vw is the correlation coefficient between v n and w n . 

In case B (see figure [TJ), there is feedback from y to x. Since conditioning is over the past of x and z and since 
there is no feedback from z to y, x n \(x, is normally distributed with variance <? yx <j\ + Thus, we obtain 

for this case 

2 2 2 
IbMV "> = 7^ l0g(l + -^) - \ l0g(l + ^) 

Setting c yx =0we get for case A, 

IaMv "> = -(l/2)log(l -7L/(^^)) (63) 

which is the instantaneous exchange rate between x and y. If the noise components v and w are independent, the 
causal conditional directed information is zero. 

The preceding illustration highlights the ability of causal conditioning to deal with different feedback scenarios 
in multiply connected stochastic networks. Figure [2] illustrates the inference result and the difference obtained if 
the third time series is not taken into account. 

VI. Conclusion 

In this paper, we have revisited the directed information theoretic concept introduced by Massey, Marko and 
Kramer. A special attention has been paid to the key role played by causal conditioning. This turns out be be a 
central issue for characterizing information flows in the case where side information may be available. We propose 
a unified framework to enable a comparative study of mutual information, conditional mutual information with 
directed information in the context of networks of stochastic processes. Schreiber's transfer entropy, a widely used 
concept in physics and neuroscience, is also shown to be easily interpreted with directed information tools. 

The second section describes and discusses Granger causality and its practical issues. Geweke's work serves 
as a reference in our discussion, and allows to provide a means to establish that Granger causality and directed 
information lead to equivalent measures in the Gaussian linear case. Based upon the previous analysis, a possible 
extension of Granger causality definition is proposed. The extended definitions rely upon information theoretic 
criterion rather than probabilities, and allow to recover Granger's formulation in the linear Gaussian case. This new 
extended formulation of Granger causality is of some practical importance for estimation issues. Actually, some 
recent works presented some advances in this direction; in [59], directed information estimators are derived from 
spike trains models; in [72], Kraskov and Leonenko entropy estimators are used for estimating entropy transfer. In 
[7], the authors recourse to directed information in a graphical modeling context; Their equivalence with generative 
graphs for analyzing complex systems is studied in [58]. The main contribution of the present paper is to provide 
a unified view that allow to recast causality and directed information within a unique framework. 

Estimation issues were not mentioned in this study, as it may deserve a full paper per-se, and are deferred to a 
future work. 
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Fig. 1. Networks of three Gaussian processes studied in the paper. An arrow represents a coupling coefficient not equal to zero from the past 
of one signal to the other. In frame A, there is no direct feedback between any of the signals. However, a feedback from y to x exists through 
z. In frame B, there is also a direct feedback from y to x. The arrows coming from the outside of the network represent the inputs, i.e. the 
dynamical noise W in the AR model. 




Fig. 2. Networks of three Gaussian processes studied in the paper. The left plot corresponds to the correct model and to the inferred network 
when causal conditional directed information is used. The network on the right is obtained if the analysis is only pairwise, when directed 
information is used between two signals without causal conditioning over the remaining signals. 
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